library(dplyr)
library(ggplot2)
library(patchwork)
library(treemapify)
library(reshape2)
library(plyr)
library(tidyverse)
library(plotly)
library(maps)
usa_deaths_states = read.csv("data/death-causes-usa.csv", sep=";") %>% filter(Cause.Name != "All causes")
spain_deaths = read.csv("data/death-causes-spain-2017-modified.csv", sep=";",fileEncoding="UTF-8-BOM")
The data published by the Centers for Disease Control and Prevention was gathered by the National Center for Health Statistics (NCHS), with the last revision being made in 2017. This data set contains the information of the 10 leading causes of death in the United States. The data is based on information from resident death certificates filed in the 50 states and the District of Columbia using demographic and medical characteristics. The data set holds 9880 observations, each with the following 6 features:
dim(usa_deaths_states)
## [1] 9880 6
head(usa_deaths_states)
## Year X113.Cause.Name
## 1 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 3 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 4 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 5 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 6 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## Cause.Name State Deaths Age.adjusted.Death.Rate
## 1 Unintentional injuries United States 169936 49.4
## 2 Unintentional injuries Alabama 2703 53.8
## 3 Unintentional injuries Alaska 436 63.7
## 4 Unintentional injuries Arizona 4184 56.2
## 5 Unintentional injuries Arkansas 1625 51.8
## 6 Unintentional injuries California 13840 33.2
The second data set details the causes of death in Spain for the year 2017. It was found on the government website, https://datos.gob.es/, although since the original project proposal is no longer available at the found location. This data was collected as part of a study done by the Spanish Institute of Statistics (INE). Due to the layout of the data it was required to transform it so it could be processed. During this process the list of disease types were reduced by combining multiple of the same type to one so it was compared to the USA data set. An example of this is the Spanish data set has 30 types of cancer listed, while the USA set has 1, so the totals for the Spanish set were totaled under the name “Cancer”. The resulting data consists of 1056 observations with the following 4 features:
dim(spain_deaths)
## [1] 1056 4
head(spain_deaths)
## DISEASE GENDER AGE NUMBER.OF.DEATHS
## 1 All causes Both All ages 424523
## 2 All causes Males All ages 214236
## 3 All causes Females All ages 210287
## 4 All causes Both 0 to 1 1092
## 5 All causes Males 0 to 1 619
## 6 All causes Females 0 to 1 473
As we see, we have different data for both countries, so the comparison will be tough. We will first filter for the country data, filtering state data in the US data set in order to view the country as a whole.
usa_deaths = usa_deaths_states %>% filter(State == "United States")
head(usa_deaths)
## Year X113.Cause.Name
## 1 2017 Accidents (unintentional injuries) (V01-X59,Y85-Y86)
## 2 2017 Alzheimer's disease (G30)
## 3 2017 Cerebrovascular diseases (I60-I69)
## 4 2017 Chronic lower respiratory diseases (J40-J47)
## 5 2017 Diabetes mellitus (E10-E14)
## 6 2017 Diseases of heart (I00-I09,I11,I13,I20-I51)
## Cause.Name State Deaths Age.adjusted.Death.Rate
## 1 Unintentional injuries United States 169936 49.4
## 2 Alzheimer's disease United States 121404 31.0
## 3 Stroke United States 146383 37.6
## 4 CLRD United States 160201 40.9
## 5 Diabetes United States 83564 21.5
## 6 Heart disease United States 647457 165.0
To investigate the causes of death over the available time frame in the USA we can use a line graph to track their trends over time. The logarithmic is taken of the death rate to better represent the data. As can be seen Heart disease and Cancer are the biggest killers for all years by a large margin. Although all causes show an increase in number this is not reflective of the true trend in death rates. By plotting the age adjusted rates it is shown that death rates due to these causes are decreasing. The reason for the larger number is most likely due to increases in population, though further investigation would be required. By using this plot can also be seen that deaths from Heart disease and Cancer are decreasing relative to the population. While Alzheimer’s disease and Suicides and Unintentional Accidents are increasing. It is fair to conclude that the increase in Alzheimer’s disease is most likely due to an aging population. Though for the other two causes further data would be required to make a conclusion on cause.
area_plot <- ggplot(usa_deaths, aes(x=Year, y=Age.adjusted.Death.Rate, fill=Cause.Name)) +
labs(title = "Trend of death causes", x = "", y = "Death Rate / 100.000 (Age Adjusted)", fill = "Causes") +
scale_color_brewer(palette = "Paired") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal() +
geom_area()
line_plot <- ggplot(usa_deaths, aes(x=Year, y=log(Deaths), color=Cause.Name)) +
labs(title = "Changes on death causes over the years", x = "", y = "log(Death Rate)", fill = "Causes") +
scale_color_brewer(palette = "Paired") +
theme(plot.title = element_text(hjust = 0.5)) +
theme_minimal() +
geom_line()
ggplotly(area_plot)
ggplotly(line_plot)
Next we can investigate the causes in death in spain. After removing the “Other causes” type from the dataset the counts of each cause are shown in a treemap. Similar to the USA dataset the main causes of death are heart disease (Diseases of the circulatory system) and cancer. Additionally Alzeimer’s disease ranks highly, fourth largest, showing it is also an issue in Spain.
#Get data about the top 10 diseases in the Spanish data set. Ignore "Other causes"
topTenSpanishDiseases<- spain_deaths %>% filter(GENDER == "Both") %>% filter(AGE=="All ages") %>% filter(DISEASE!="All causes") %>% filter(DISEASE!="Other causes") %>% top_n(10, NUMBER.OF.DEATHS)
#Generate tree map
ggplot(topTenSpanishDiseases, aes(area=NUMBER.OF.DEATHS, fill=DISEASE, label=NUMBER.OF.DEATHS)) +
scale_fill_brewer(palette = "Paired") +
labs(title = "Most deadly diseases in Spain (2017)", fill = "Causes") +
geom_treemap() +
geom_treemap_text(fontface = "italic",
colour = "white",
place = "centre",
grow = FALSE,
reflow = TRUE)
As the Spain data also has the data broken down by gender we use this in a bar graph to investigate the difference in cause of death between females and males. As can be seen there are varying differences between how the two genders are affected by each disease. The difference in Alzhelmer’s can be explained by the fact that on average women live longer than men. But the reason for the differences in within the other causes is unknown (to the research group) and require further study.
topTenSpanishDiseasesByGenre <- spain_deaths %>% filter(GENDER != "Both") %>% filter(DISEASE != "All causes") %>% filter(AGE == "All ages") %>% filter(DISEASE %in% topTenSpanishDiseases$DISEASE)
topTenSpanishDiseasesByGenre$DISEASE <- with(topTenSpanishDiseasesByGenre, reorder(DISEASE, NUMBER.OF.DEATHS))
barplot <- ggplot(topTenSpanishDiseasesByGenre, aes(fill=DISEASE, y=NUMBER.OF.DEATHS, x=GENDER)) +
scale_fill_brewer(palette = "Paired") +
geom_bar(position=position_stack(), stat="identity", width=0.4) +
labs(x="", y = "Number of deaths (2017)")+
theme_minimal()
barplot
#ggplotly(barplot) %>% layout(bargap=0.1) #messes with the legend
For this report we have decided to investigate the causes of heart disease. Using the information available in the dataset we looked at the differences between the gender and age of those who died from the disease, plotted in a pyramid graph.
From the data available it can be easily determined that older women are more likely to die from heart disease, but we can gather no further insights. Further analysis of the causes is performed in part two of the report.
#Create data frame to hold data about people who have Diseases of the circulatory system
diseasesCircSystemData <- spain_deaths
#Set up AGE factor for pyramid plot
diseasesCircSystemData$AGE <- factor(diseasesCircSystemData$AGE, c("0 to 1", "1 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85 to 89", "90 to 94", "95 or more", "All ages"))
#Filter data to create table about only circular system disease
diseasesCircSystemData <- diseasesCircSystemData %>% filter(DISEASE == "Diseases of the circulatory system") %>% filter(GENDER != "Both") %>% filter(AGE != "All ages") %>% modify_at("DISEASE",~NULL)
summary(diseasesCircSystemData)
## GENDER AGE NUMBER.OF.DEATHS
## Both : 0 0 to 1 : 2 Min. : 0.0
## Females:21 1 to 4 : 2 1st Qu.: 14.5
## Males :21 5 to 9 : 2 Median : 348.5
## 10 to 14: 2 Mean : 2274.5
## 15 to 19: 2 3rd Qu.: 2880.8
## 20 to 24: 2 Max. :13648.0
## (Other) :30
#Mutate Male data so it will be negative on bargraph
diseasesCircSystemData <- diseasesCircSystemData %>% mutate(NUMBER.OF.DEATHS = ifelse(GENDER == "Males", -1 * NUMBER.OF.DEATHS, NUMBER.OF.DEATHS))
#Generate pyramid plot
ggplot(diseasesCircSystemData, aes(x = AGE, y = NUMBER.OF.DEATHS, fill = GENDER)) +
geom_bar(data=diseasesCircSystemData[diseasesCircSystemData$GENDER == "Females",], stat = "identity") +
geom_bar(data=diseasesCircSystemData[diseasesCircSystemData$GENDER == "Males",], stat = "identity") +
scale_y_continuous(breaks = seq(-15000, 15000, 5000),
labels = paste0(as.character(c(seq(15, 0, -5), seq(5, 15, 5))), "")) +
coord_flip(ylim=c(-15000,15000)) +
scale_fill_brewer(palette = "Set1") +
labs(title = "Number of people who die from Diseases of the circulatory system", y = "Number of deaths (1,000)", x = "Age", fill = "Gender") +
theme_bw()